What is exploratory data analysis (EDA)

  • maximize insight into a data set
  • detection of mistakes
  • detect outliers and anomalies
  • test underlying assumptions
  • determining relationships among the explanatory variables

EDA types

  • univariate non-graphical
  • multivariate non-graphical
  • univariate graphical
  • multivariate

UNG: variables and types - Python - 1

  • getting basic information
import pandas as pd
import numpy as np

url='https://raw.githubusercontent.com/urosgodnov/datasets/master/winequality-white.csv'
data = pd.read_csv(url, sep=";")
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 4898 entries, 0 to 4897
## Data columns (total 12 columns):
##  #   Column                Non-Null Count  Dtype  
## ---  ------                --------------  -----  
##  0   fixed acidity         4898 non-null   float64
##  1   volatile acidity      4898 non-null   float64
##  2   citric acid           4898 non-null   float64
##  3   residual sugar        4898 non-null   float64
##  4   chlorides             4898 non-null   float64
##  5   free sulfur dioxide   4898 non-null   float64
##  6   total sulfur dioxide  4898 non-null   float64
##  7   density               4898 non-null   float64
##  8   pH                    4898 non-null   float64
##  9   sulphates             4898 non-null   float64
##  10  alcohol               4898 non-null   float64
##  11  quality               4898 non-null   int64  
## dtypes: float64(11), int64(1)
## memory usage: 459.3 KB

UNG: variables and types - Python - 2

  • getting basic information
data.describe()
##        fixed acidity  volatile acidity  ...      alcohol      quality
## count    4898.000000       4898.000000  ...  4898.000000  4898.000000
## mean        6.854788          0.278241  ...    10.514267     5.877909
## std         0.843868          0.100795  ...     1.230621     0.885639
## min         3.800000          0.080000  ...     8.000000     3.000000
## 25%         6.300000          0.210000  ...     9.500000     5.000000
## 50%         6.800000          0.260000  ...    10.400000     6.000000
## 75%         7.300000          0.320000  ...    11.400000     6.000000
## max        14.200000          1.100000  ...    14.200000     9.000000
## 
## [8 rows x 12 columns]

UNG: variables and types - Python - 3

  • mean value is less than median value
  • large difference between 75th %tile and max values of predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”
  • there are extreme values-Outliers in our data set

UNG: variables and types - Python - 4

  • analysis of dependent variable
data.quality.unique()
## array([6, 5, 7, 8, 4, 3, 9], dtype=int64)
data.quality.value_counts()
## 6    2198
## 5    1457
## 7     880
## 8     175
## 4     163
## 3      20
## 9       5
## Name: quality, dtype: int64

UG: variables - Python - 1

  • graphical check of missing values
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(data.isnull(),cbar=False,yticklabels=False,cmap = 'viridis')

UG: variables - Python - 2

  • checking the outliers
  • convenient definition of an outlier is a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile
l = data.columns.values
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(number_of_columns,4*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(data[l[i]],color='green',orient='v')
    plt.tight_layout()

UG: variables - Python - 3

  • checking the distribution
plt.figure(figsize=(2*number_of_columns,60))
for i in range(0,len(l)):
    plt.subplot(12 + 1,number_of_columns,i+1)
    sns.distplot(data[l[i]],kde=True) 

MG: variables - Python - 1

  • multivariante analysis: linear correlation
k = 12 #number of variables for heatmap
cols = data.corr().nlargest(k, 'quality')['quality'].index
cm = data[cols].corr()
plt.figure(figsize=(10,6))
sns.heatmap(cm, annot=True, cmap = 'viridis')

Getting a report on data

#full report
from pandas_profiling import ProfileReport
prof = ProfileReport(data)
prof.to_file(output_file='Reportoutput.html')

#simple report
prof_minimal=ProfileReport(data, minimal=True)
prof_minimal.to_file(output_file="Reportoutput_min.html")

Lab

  • use mtcars dataset
  • for pyhton use:
url='https://raw.githubusercontent.com/urosgodnov/datasets/master/mtcars.csv'
data = pd.read_csv(url)
  • mpg variable is the target variable
  • perform EDA ‘manually’
  • what can you tell me about mtcars dataset? Anything unusual?

UNG: variables and types - R - 1

  • getting basic information (str)
url<-'https://raw.githubusercontent.com/urosgodnov/datasets/master/winequality-white.csv'
data <- read.csv2(url, dec=".")
str(data)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

UNG: variables and types - R - 2

  • getting basic information
summary(data)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

UNG: variables and types - R - 3

  • analysis of dependent variable
library(tidyverse)
data%>%select(quality)%>%distinct()
##   quality
## 1       6
## 2       5
## 3       7
## 4       8
## 5       4
## 6       3
## 7       9
data%>%select(quality)%>%count(quality, sort = TRUE)
##   quality    n
## 1       6 2198
## 2       5 1457
## 3       7  880
## 4       8  175
## 5       4  163
## 6       3   20
## 7       9    5

UG: variables - R - 1

  • graphical check of missing values
  • using naniar library
library(naniar)
gg_miss_upset(airquality)

UG: variables - Python - 2

  • checking the outliers
df<-data%>%pivot_longer(cols=fixed.acidity:quality,names_to="variable", values_to = "value")

ggplot(data = df, aes(x=variable, y=value)) + 
              scale_y_log10() +
              geom_point() +
              geom_boxplot(outlier.colour='blue')+
              coord_flip()

UG: variables - R - 3

  • checking the distribution
  ggplot(df,aes(value)) +
    facet_wrap(~ variable, scales = "free") +
    geom_histogram() 

MG: variables - R - 1

  • multivariante analysis: linear correlation
library(PerformanceAnalytics)

chart.Correlation(data)

Getting a report on data

#full report
library("dataMaid")

makeDataReport(data,
               render = TRUE,
               output = "html",
               file = "FullReport.html", 
               replace = TRUE,
               reportTitle ="Full report on dataset")
#simple report
library("summarytools")
view(dfSummary(data))

#interactive shiny report
library("explore")
explore(data)

Lab

  • use airquality dataset
  • Ozone is the target variable
  • perform EDA ‘manually’
  • what can you tell me about the airquality dataset? Anything unusual?